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1. INTRODUCTION 

Tomato is a plant that is consumed in varied varieties all over the world like vegetable curries, jams, 
sauces, chutneys, salads, and drinks. India ranks second [1] in the global tally in tomato production but the 
yield it gets per hectare is very less than compared to other countries. The reason behind it is that the tomato 
crop in India gets affected by bacterial diseases (wilt, early blight) and leaf curl virus which causes a loss of 
up to 70-100% of production (Indian Institute of Horticultural Research) [1]. Various institutes are trying to 
bring hybrid disease-resistant varieties to solve the problem, but they all are very time-consuming and costly 
affairs. With the advancement of technology, we can get better solutions like the prediction of diseases at 
early stages and correct remedy at the appropriate time can save a huge production. In India the major states 
producing tomatoes are shown in Figure 1: in the northern region are Bihar, Uttar Pradesh, in the western 
region: in Maharashtra, in the southern region are Karnataka, Andhra Pradesh, in the eastern region, is 
Assam, Orissa, and in the central region is Madhya Pradesh [2]. 

Generally, tomato is known as a warm season crop that could give the best quality at a temperature 
between 21-24 °C. Temperature above 30 °C and frost and humidity affect the plant tissues which results in 
the development of various fungal, bacterial, or viral diseases, hence, detection of crop disease at an early 
stage is a research area. The major diseases of tomato that affect the crop are given in Figure 2. 

The farmers detect the disease by looking at the parts of the crop with naked eyes (which requires 
expertise) or they send the sample to the centers for testing (time-consuming). Few techniques that are used 
to detect crop diseases are thermography (use to detect the change in the surface temperature due to reduction 
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in transpiration and the limitation is not able to detect identify/the disease type), fluorescence (uses the 
method to measure the change in photosynthetic activity and chlorophyll for detection of the pathogen, and 
the drawback is that is of limited practical use), gas chromatography (uses volatile organic compounds to 
detect the nature and type of infection and the disadvantage of this method is the lack of pre-collected sample 
of VOC's). So, the better option is the use of technology through machine learning algorithms. 


Da 
+ 


Figure 1. Major states in India produce tomatoes [2] 
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Figure 2. Sample of major diseases of tomato in India (NHB, India database) 


Several studies were done earlier that show the capability of technology using algorithms of 
machine learning to identify objects in various sectors like retail, human behavior, face recognition [3], facial 
expressions [4], handwriting recognition [5], intrusion detection [6], movie recommendation [7], and food 
segmentation [8]. The health sector is also using it to identify/detect/predict various diseases like diabetes 
prediction, cancer detection [9], [10], heart disease, skin problems, Parkison’s disease identification [11], 
COVID-19 from chest X-rays [12], and many more. The crop disease classification can also be done 
efficiently and effectively using machine learning. We discovered in our study that various machine learning 
algorithms are applied to various categories of agriculture finding the deficiency of nutrients in maize [13], 
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and groundnut leaf disease detection using convolutional neural network (CNN) and bag of features (BOF) 
with speeded-up robust feature (SURF) [14]. 

While this study, we found that machine learning algorithms can be applied to identify tomato leaf 
diseases. Support vector machines (SVM) algorithm was used to identify black spots, cankers, and melanose 
by extracting the color, shape, and texture features, and segmentation was used to detect the diseases [15]. 
PCA and one hot encoding were used by Reddy and an accuracy of more than 85% was achieved [16]. The 
various data augmentation techniques like contrast, random zoom, and crop, the central zoom was used to 
identify early and late blight, and leaf mold diseases with the CNN 5-fold cross-validation algorithm was 
used and an accuracy of 98% was achieved [17]. Labellmg tool was used to annotate the images with faster 
region (RCNN) with ResNet 50 used to achieve an average accuracy of 81% for early blight, leaf curl, 
Septoria, and bacterial leaf spot diseases [18]. A CNN with 8 hidden layers was used and an accuracy of 
98.4% accuracy was achieved [19]. Pre-processing methods like image flip horizontal, and rotations were 
used to increase the data 6 times was used and a comparative study of MobileNetV2, Xception, and 
MobileNetV3 proved that NetMobile was better among them [20]. YoloV3 was improved by using 53 layers 
of convolutions and 5 layers of max pooling and an accuracy of more than 92% was achieved [21]. Early and 
late blight, leaf mold, bacterial leaf spot, and yellow leaf curl diseases were studied in China using ABCK 
[22]. Early blight, bacterial spot, septoria leaf spot; iron chlirosis were studied a few years back with 100% 
accuracy in early blight [23]. Multiple linear regression method was used to detect the diseases [24]. Color 
descriptors and textures were used to extract the features and a comparative study of algorithms like KNN, 
ANN, random forest (RF), Naive Bayes (NB), and SVM was done and found that RF has achieved better 
accuracy than others [25]. Color histograms, Hu moments, local binary pattern, and haralick features are used 
for testing and training purposes where RF and DT were used, and RF again has shown better accuracy [26]. 
Yolo model with a train-test split ratio of 90%-10% was used and an accuracy of 76% was achieved [27]. 
Features were extracted using image processing tools like contrast, energy, correlation, mean homogeneity, 
entropy, variance, standard deviation, root mean square, skewness, kurtosis, features extracted were divided 
into 5 subsets and then these vectors were and classified using back propagation NN [28]. Image sizes were 
changed, and the noise was filtered using the weiner filter technique. Segmentation was done using a 
modified K-means image segmentation algorithm. The features were extracted using contrast, energy, 
entropy, homogeneity, and, uniformity property from the image segmentation and the feature extraction 
method used was grey level co-occurrence matrix (GLCM). Leaves were classified using SVM and adaptive 
neuro-fuzzy inference system (ANFIS) [29]. Leaf curl virus affected 70% of crop [30], [31]. The approach of 
bayesian learning using probabilistic approach is used and found better results than other optimized models 
[32]. K-means algorithm is used to diagnose the infected areas in the leaves and then multi-SVM was used 
for classification [33]. Through the review, it is understood that the majority of the work is done with 
traditional algorithms of machine learning. We are trying to do a comparative study of traditional machine 
learning algorithms with deep learning CNN and trying to find out if deep CNN provides better results. 


2. METHOD 
2.1. Data collection 

Data of tomato leaf images are collected from plant village, a publicly available repository. The 
following four categories of diseases are chosen for this study: 

a. Tomato mosaic virus (TMV): 1,000 images of it were collected. During the pandemic, a huge tomato 
crop loss in Maharashtra (Pune, Nashik, Ahmednagar, Satara) happened due to tomato viral diseases, 
and farmers lost more than 60 percent of the crop [22]. This virus affects the various parts of the plant 
like fruit, stem, and leaves. The fruit might have a brown or yellow color with reduced size. On the 
stem, dark brown colored patches appeared, and the leaves become yellow and green with the 
appearance of mottled and a mosaic. 

b. Tomato septoria leaf spot (TSLS): 1,771 images were collected. Irregular small or round spots are grey 
in the center and there are dark color margins on leaves. It generally affects the lower leaves of the 
plant. Flowers and stems are attacked sometimes, but the fruits are attacked rarely by this disease. 
Hence the name septoria leaf spot. 

c. Tomato target spot (TTS): 1,404 images were collected. This disease is much prevalent in West Bengal 
mainly in the Gangetic alluvial region (Adam Kamei). Small dark to large light brown color lesions 
appears on leaves and fruits. 

d. Tomato yellow leaf curl virus (TYLCV): 5,357 images were collected. This disease is more prevalent in 
Tamil Nadu in India. Major symptoms of this disease are curling of leaves, reduced leaf size, plant 
starts rolling down and new leaves exhibiting yellow color. 
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e. Tomato healthy leaves (TH): 1,591 were collected. Healthy leaves are medium-sized leaves with soft 
fuzz. Once the model will be trained with the above-mentioned diseases, then we can compare it with 
the healthy leaves and find out the accuracy of the model. 


2.2. Image annotation 

Images were annotated using the Image annotation tool and knowledge base of the above said five 
classes created. Image annotation tool helps to annotate or label the images. After annotation, the features of 
each image will be recognized by the machine learnig model. 


2.3. Training 
Machine learning training was done with an 80%-20% train-test split with the randomized stratified 

method. The training was done using the following algorithms: 

a. Random forest with the following properties: i) no. of trees: 10, ii) split subsets greater than 5, and 
iii) repeat train/test: 10. 

. Naive Bayes with 80%-20% train test split. 

c. SVM with the following properties: i) regression loss epsilon=0.10, ii) the kernel used as RBF, and 
iii) iteration of operational parameters: 100, numerical tolerance: 0.0010. 

d. CNN with pre-trained inception v3 model and the following properties: i) neurons in hidden layer: 100, 
ii) activation: ReLu, iii) solver: Adam, and iv) maximal number of Iterations with replicable training: 3,500. 


2.4. Results 

The evaluation results are mentioned below. In this subsection, it is observed that CNN has shown 
better results than others. The detailes are mentioned below: 
a. Evaluation results 

The evaluation results of the experiment are shown in Table 1. It could be seen that classification 
accuracy, precision, and F1 score are much higher in a NN as compared to SVM, RF, and NB. 


Table 1. Experimental results 
Model _ AUC CA Fl Precision _ Recall 
SVM 0.997 0.963 0.963 0.963 0.963 
RF 0.977 0.877 0.876 0.876 0.877 
NN 0.999 0.981 0.981 0.981 0.981 
NB 0.945 0.775 0.775 0.796 0.766 


b. Confusion matrices 

Let’s see the confusion matrices of all the algorithms applied in this experimental study in the 
following tables. Table 2 shows the confusion matrix of RF. The RF has detected the diseases with the 
following accuracy as shown in Table 3. NB confusion matrix is shown in Table 4. NB has detected the 
diseases with the following accuracy as per Table 4 shown in Table 5. SVM confusion matrix results can be 
seen in Table 6. SVM has detected the diseases with the following accuracy as per Table 6 with the details 
shown in Table 7, and finally, the confusion matrix of CNN with Inception v3 model is seen in Table 8. CNN 
has detected the diseases with the following accuracy as given in Table 8 calculated and shown the results in 
Table 9. It is observed with the experiment that yellow leaf curl virus disease is classified more accurately by 
these algorithms as the data set for it was more as compared to others. 


Table 2. Confusion matrix for RF 
TH TMV TSLS TTS TYLCV D 

TH 2747 22 82 278 51 3180 
TMV 66 1494 220 81 139 2000 
TSLS 91 162 2816 274 197 3540 
TTS 294 67 300 2022 127 2810 
TYLCV 18 40 108 122 10432 10720 
x 3216 1785 3526 2777 10946 22250 


Table 3. Disease accuracy % of RF 
TH(%) TMV (%) TSLS(%) TTS(%) TYLCV (%) 
86.3 74.7 79.54 71.96 97.31 
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Table 4. Confusion matrix for NB 
TH TMV TSLS TTS TYLCV > 


TH 1976 139 161 873 31 3180 
TMV 99 1563 214 86 38 2000 
TSLS 214 432 2399 379 116 3540 
TTS 384 115 245 2048 18 2810 


TYLCV 142 446 575 506 9051 10720 
> 2815 2695 3594 3892 9254 22250 


Table 5. Disease accuracy % of NB 
TH(%) TMV (%) TSLS(%) TTS(%) TYLCV (%) 
62.13 78.15 67.77 72.88 84.43 


Table 6. Confusion matrix for SVM 
TH TMV TSLS TTS TYLCV D 


TH 3051 5 8 116 0 3180 
TMV 3 1911 67 5 14 2000 
TSLS 14 61 3173 239 53 3540 
TTS 42 17 97 2596 58 2810 

TYLCV 0 6 8 20 10686 10720 


Dy 3110 2000 3353 2976 10811 22250 


Table 7. Disease accuracy % of SVM 
TH (%) TMV (%) TSLS(%) TTS (%) TYLCV (%) 
95.94 95.5 89.63 92.38 99.68 


Table 8. Confusion matrix of NN with Inception v3 model 
TH TMV TSLS TTS  TYLCV D 


TH 3136 6 2 35 1 3180 
TMV 3 1963 16 3 15 2000 
TSLS 9 36 3403 78 14 3540 
TTS 48 Ji 81 2646 28 2810 

TYLCV 1 7 7 28 10677 10720 


> 3197 2019 3509 2790 10735 22250 


Table 9. Disease accuracy % of CNN 
TH(%) TMV (%) TSLS(%) TTS(%) TYLCV (%) 
98.61 98.15 96.13 94.16 99.59 


c. Accuracy 

The accuracy percentage of the experiment is given in Table 10. It is observed in the experiment that 
SVM has also shown acceptable accuracies, but for septoria leaf spot it is not able to diagnose it so 
accurately. 


Table 10. Disease accuracy % of RF, NB, SVM, NN 
Algorithm TH(%) TMV (%) TSLS(%) TTS(%) TYLCV (%) 


RF 86.4 74.7 79.54 71.95 97.31 
NB 62.14 78.15 67.77 72.88 84.43 
SVM 95.94 95.55 89.63 92.38 99.68 
NN 98.61 98.15 96.13 94.16 99.59 


d. ROC curve analysis 

Receiver operating characteristic (ROC) curve is used to compare various classification models and 
to evaluate their accuracy. It’s a 2-dimensional plot with the ratio of true positive rate (TPR) on the Y—axis 
and false positive rate (FPR) on the X-axis where: 


TP 
TP+FN 


TPR= 


(1) 
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FP (2) 


~~ FP+TN 


FPR: 


The area under the curve (AUC) gives the analytical computation comparing the accuracy of the classification 
models under consideration. The classifier having the greatest AUC is preferable. Figure 3 shows the 
classifiers list. The ROC curves are shown in Figure 4 (TH), Figure 5 (TMV), Figure 6 (TSLS), Figure 7 
(TTS), and Figure 8 (TYLCV). Here, it is observed that CNN has more AUC than compared to other models. 


Classifiers 

E Random Forest 
I Naive Bayes 

E svm 

E Neural Network 


Figure 3. Classifiers list 
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Figure 4. ROC curve of tomato healthy 


TP Rate (Sensitivity) 
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FP Rate (1-Specificity) 


Figure 5. ROC curve of TMV 
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TP Rate (Sensitivity) 
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Figure 6. ROC curve of TSLS 


TP Rate (Sensitivity) 
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Figure 7. ROC curve of TTS 


TP Rate (Sensitivity) 


FP Rate (1-Specfisty) 


Figure 8. ROC curve of TYLCV 
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e. Samples of misclassified 

Figures 9-13 shows the misclassified samples of TH, TMV, TSLS, TTS, and TYLCV. It is observed 
that TSLS and TTS are misclassified more as compared to TMV and TYLCV. TSLS is majorly misclassified 
as TTS and vice versa. In the future, this could be reduced by using image processing techniques and by 
increasing the data set for TSLS, TTS, and TMS. 
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Figure 10. Sample of TMV leaves misclassified 
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Figure 12. Sample of TTS leaves misclassified 
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Figure 13. Sample of TYLCV misclassified 
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3. CONCLUSION 

A comparative experimental study is done for four machine learning algorithms namely SVM, NB, 
RF, and CNN for four tomato disease detection that is TMV, TTS, TYLCV, and TSLS. SVM, RF, FFNN, 
and NB were used by many authors. Many classification techniques were observed, and they performed in a 
a different form for different datasets when applied to different tools. All classification techniques have some 
pros and cons when applied to classify diseases. The result of this experiment shows that deep learning 
methods CNN and its variants detect crop diseases with more accuracy. In the future, deep CNN with Image 
processing techniques can be explored to detect the diseases and then compared with other classification 
methods. Also, a comparative study of various deep CNN can be done. 
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